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Abstract 

Information Integration is a young and exciting field with enormous 
research and commercial significance in the new world of the Information 
Society. It stands at the crossroad of Databases and Artificial Intelligence 
requiring novel techniques that bring together different methods from these 
fields. Information from disparate heterogeneous sources often with no a- 
priori common schema needs to be synthesized in a flexible, transparent and 
intelligent way in order to respond to the demands of a query thus enabling 
a more informed decision by the user or application program. The field 
although relatively young has already found many practical applications 
particularly for integrating information over the World Wide Web. This 
paper gives a brief introduction of the field highlighting some of the main 
current and future research issues and application areas. It attempts to 
evaluate the current and potential role of Computational Logic in this and 
suggests some of the problems where logic-based techniques could be used. 

1 Introduction 

Over the years a vast and diverse amount of data has been collected or created 
and stored by different users for different purposes. In the last decade it has been 
realized that new information can be extracted from this data by synthesizing in an 
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advanced, sometimes referred to also as intelligent, way the available information 
from the initial separate sources. Such synthesis or integration of information 
would facilitate and enable high-level tasks such as planning and decision making. 

The problem of information integration has its origins in the problems of mul- 
tidatabases, federated databases and interscheme extraction for global schema 
construction. Recently, the need to address this problem has been further em- 
phasized with the explosion in the amount of information that is now available 
on-line over the information networks such as the Internet and the World Wide 
Web. This development has given new dimensions to the task of information inte- 
gration requiring the integration of non-structured, highly distributed autonomous 
and dynamic information sources. A flexible and scalable strategy for integrating 
these disparate information sources while respecting their autonomy is required. 
At the same time it has brought with it an enormous commercial potential that 
is already helping to drive research in the field. 

The general task of Information Integration can be simply stated, in an infor- 
mal way, as that of "getting information sources to talk to each other" in order 
to support some higher level goal. We want to achieve integration of information 
without integrating the information sources themselves. Thus an information in- 
tegration system must provide a uniform interface to information sources allowing 
the user or application program to focus on specifying what they require rather 
than specifying how or where to find the information. Some of the basic technical 
problems that need to be addressed by such a system are: modeling of the infor- 
mation content, query reformulation with flexible selection of information sources 
and query optimization. For many of these problems it is possible to employ, 
together with database techniques, other techniques from Artificial Intelligence 
(AI) and soon it was realized that the field of Information Integration lies at the 
intersection of Databases with Artificial Intelligence and other areas such as Infor- 
mation Retrieval and Human Computer Interaction. This in turn has now given 
an excellent application domain for (weak) AI. 

One of the major difficulties of information integration stems from the fact 
that the information sources involved employ different data models and are het- 
erogeneous both semantically and syntactically. In addition, it is envisaged that 
information integration systems would be able to support in an integral way ad- 
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vanced applications exploiting the richness of the information made available by 
these systems. These two factors necessitate rich formalisms for semantic in- 
terpretation of the available information so that it can be understood by the 
other sources and the application programs. We are thus led to the need for se- 
mantic (or intelligent) information integration where, along with techniques from 
databases, Artificial Intelligence and Computational Logic can play a significant 
role in providing explicit and declarative representation frameworks for modeling 
the information and the complex behaviour of the system. The autonomous na- 
ture of the information sources, where we may only be able to represent partial 
information and where information from different sources may be overlapping or 
even inconsistent, increases the possibilities to apply techniques from the areas of 
Artificial Intelligence and Computational Logic. 

In particular, a declarative logical representation can form the conceptual ba- 
sis for the basic architecture of Information Integration systems. The primary 
role of Computational Logic transpires to be that of providing an overall coop- 
eration layer that would link, through suitable forms of inference, the different 
information sources into an intermediate layer. This intermediate layer is called 
in the Information Integration literature, a mediator. Within this logical frame- 
work it would also be easy and natural to employ specific Computational Logic 
methods for particular subtasks required inside an information integration sys- 
tem. A hybrid computational model emerges where logic and its reasoning offer 
the computational link between high-level complex requirements and lower-level 
(typically non-logical) computational tasks such as searching, constraint solving, 
retrieval and various forms of manipulation of data. 

The rest of the paper is organized as follows. Section 2, briefly reviews the 
main elements of Information Integration Systems and the current role of logic 
in them. Section 3 presents some of the Information Integration systems which 
use (to a varying extend) logic and discusses the current scope of application of 
such systems. Section 4 presents some of the future problems in the area and the 
challenges for Computational Logic stemming from these. The paper concludes 
with a short appraisal of the potential role of Computational Logic in the area. 

Disclaimer: This paper is a survey written mainly for the non-specialist in the 
field of Information Integration and does not claim to provide a complete ac- 
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count of the area and its particular systems and applications. Furthermore, the 
presentation of the paper follows a particular view concerned with the possible 
role of Computational Logic in this new field attempting to show links between 
existing work in Computational Logic and problems arising in the area of Informa- 
tion Integration. In some cases, these potential links reflect the authors personal 
opinion. 



2 State of the Art 

The first roots of Information Integration can be found in the area of databases 



with the problem of constructing mutlidatabases \\132[ p4| , |56| . This aspired to 
form databases where different sites of heterogeneous databases would cooperate 
and be updated together. In order to reduce the complexity of the problem the 
notion of Federated Databases \\177[ |186| , |124| was proposed and later on Data 



warehousing WJL |185| , |193|| helped to simplify the problem even more by storing 
data in one site. 

2.1 The Basic Architecture 

To address the advanced requirements of future information systems, Wiederhold 



proposed ||194|| a general system architecture where a central role is assigned to 
software modules called mediators. A mediator is an information system com- 
ponent that lies between the data sources and the user and her applications. A 
mediator "exploits encoded knowledge about the data to create information for 
a higher layer of application". Its added value is exactly this transformation of 



data to information. The tasks of a mediator according to ||196|| include: 



accessing and retrieving relevant data from multiple heterogeneous resources, 

abstracting and transforming retrieved data into a common representation 
and semantics, 

integrating the homogenized data according to matching keys, and 

reducing the integrated data by abstraction to increase the information den- 
sity in the result to be transmitted. 
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The mediator-based architecture has been to a large extend adopted in the 
development of the reference architecture for Intelligent Information Integration 
(13) supported by DARPA ||. More importantly, the need for an intelligent 
middleware between the sources and the user is a common assumption in most of 



the work in 13. As already noticed in [ |194|| , to perform their functions effectively, 
mediators will embody techniques from logical databases and AI. It is inside the 
mediator, in its design and implementation, where logic can play an important 
role in the information integration systems of the future. 



2.2 Inside the Mediator 

Among the above mentioned tasks of the intelligent mediator middle-ware, the 
first, often called the information gathering problem, is the most studied. This 
task can be decomposed in the following sub-tasks. 

• modeling the contents of information sources 

• modeling the relation between sources and the mediator 

• query reformulation (query planning) 

• query optimization and execution 



2.2.1 Modeling and Query Reformulation 

As mentioned above a mediator needs to support adequately a wide range of tasks 



and provide a variety of functionality based on these tasks || |196|1 . In order to 
achieve this mediators need to use models of the various information sources that 
are available, referred to as the source model, as well as models of the users (or the 
applications) view of the problem world, referred to as the world model, and to 
link these models together. Among these modeling tasks of a mediator the linking 
of the world model to the source model is of central importance. In most of the 
early studies for developing such models the simplifying assumption is made that 
the sources can be modeled as relational databases. Similarly, it is assumed that 
the world model is captured by a set of relations. These relations are not stored 
in any of the information sources, but are those on which the user (or application 
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program) will pose queries. In DB terminology, we say that the world's model 
relations define a global mediated schema. 

Roughly speaking, the relations in the mediated schema are used by the user 
(or her/his application program) in order to specify what is the information s/he 
wants, while the source relations determine where the information is stored. To 
be able to answer user queries the mediator needs to know how the global schema 
relations are related to those of the information sources. Having this knowledge 
the middle-ware of an information integration system translates queries on the 
mediated schema to queries on the source relations. 

There are basically two approaches to relate sources to the mediated schema. 
The first is the local as view approach and the second the global as view approach 
||3| (also called source definition and view definition respectively ). In the first ap- 
proach sources are considered to be materialized views over the mediated schema. 
Datalog programs of the form 

source_relation :- global_relation_l , global_relation_n 

are used to link source relations to the mediated schema describing in this way 
the contents of the source in terms of the global mediated schema. A datalog 
rule like this is called a conjunctive query expressing the fact that the query given 
by the body of this rule can be answered using the source relation found at the 
head of the rule. More generally, query translation can be seen as synthesizing 
queries from views ||191|| , a problem that has been studied quite extensively || 



55], |66|, [L19| , |127| , |124j| . Levy ||129|| provides a thorough survey of this problem of 
answering queries using views in different application areas, including information 
integration. 

In the global as view approach the mediated relations are considered as views 
over the source relations. Rules of the form 

global_relation :- source_relation_l , source_relation_n 

relate the source relations to the global relations. Query translation then takes 
the form of simple view unfolding. 

Ullman in ||191|| compares the two approaches, and identifies their advantages. 
The global as view approach makes the query translation problem easier. It can 



also draw from work on abduction ||108||, when we want to increase the expressive 
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power of the modeling language to include constraints, negation and other pow- 
erful modeling features. On the other hand, the local as view approach simplifies 
adding and deleting sources and therefore it is particularly useful in dynamic en- 
vironments. Moreover, it facilitates the specification of constraints on the content 
of the sources as we can attach such constraints in the conjunctive query that 
describes them. In the global as view approach constraints on the contents of 
a source can be expressed separately as meta-level integrity constraints on the 
source relations. 



Among the existing system TSIMMIS |88|, Disco [187|, and Coin [2|, |6, [27 



adopt the second approach while Information Manifold |[L28| , |124| and Infomaster 
pl| , |64j| follow the first. Moreover, recently there have been systems that combine 
the two approaches |85], Q. We note here that description logics are also used as 
the underlying formalism for mediator modeling in some information integration 
systems, including SIMS |9|] and PICSEL P^| . The general role of description 
logic in information integration has been studied in |37], 39|, while work on 
answering queries using views in the context of description logic is described in 
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Once the mediator translates user queries to sources relations, it has an infor- 
mation gathering plan, which in abstract terms can be seen as a conjunction over 
the source relations. At this stage the mediator knows what are the information 
sources that it needs to access in order to answer the query. Then it must retrieve 
the information form the sources by sending requests to them in a language that 
they understand. More importantly, the mediator should not post to the sources 
queries that are beyond their answering capabilities. Consequently the mediator 
must also model the query constraints of the sources. An important family of such 
constraints are binding restrictions, i.e. information that specifies that a certain 
source can only answer queries that have some of their attributes bound. Such 
query processing systems that take into account source capabilities are described 
in 



84, 89, 168, 173, 188, 18S, 202, 201 



Another important feature of a mediator at this stage is its capability to model 
various forms of local completeness of the sources, that is expressing the extend 
to which a source is complete for the domain that it covers. This information can 



help the system to restrict and improve access to the sources [70, 36, B3L US, 122 



7 



2.2.2 Query Optimization and Execution 

Apart from being sound (the answers returned to the user are correct) and com- 
plete (all correct answers to the query are found) information gathering plans need 
also to be efficient. Query optimization in the context of information integration 
has several aspects. These can be divided into logical optimization and execu- 



tion optimization. Logical optimization ||119|| , seeks to eliminate redundancies 
from the information gathering plans possibly by exploiting local completeness 
information JF(| |jf| ^ |TT|, |l22j. Like in a traditional DB system, the resulting 



logically optimized plan is passed to a query optimizer that has to come up with 
an efficient query execution plan to achieve execution optimization. While this 
is a well-studied problem in the relation DB theory, information integration adds 
extra complexity ||119| , |123| . 



Garlic is one of the early integration systems that address logical and execu- 
tion optimization by extending previous work in relational DB [99|. More recent 



work in the field is concerned with using local completeness and source overlap 
information for ordering source accesses in order to maximize the likelihood of 
obtaining answers as early as possible, and to minimize the access cost and the 
network traffic f52" , |119| , |190[| . Another way to consider optimization is to add new 



sub-queries in the query plan specifically for the purpose of optimization [T(| . 

The dynamic nature of the environment in which information integration sys- 
tems operate require that their query execution engines have advanced capabil- 
ities. Information integration over a network like the Internet has to take into 
account that sources can become slow or unavailable during the execution of a 
information gathering plan. Query planning needs to be interleaved with query 



execution, and the system must be able to re-plan and re-optimize ||112| , 106] . 



2.3 Source Structure and Heterogeneity 

The previous discussion makes the simplifying assumption that the contents of 
the information sources can be described by a set of relations, However, there 
are two separate problems with this approach ||121|| . The first is that sources 



can be physically semistructured, where the structure of the source information 
is obscured by the way that this information is stored e.g. in HTML file sources 
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often the data is not stored directly in its structured form due to additional mark- 
up information used to make the source human readable. On the other hand data 
can be logically semistructured in the sense that the data does not necessarily fit 
in a schema. Data can be so irregular that the schema can be very large. 

To cope with physically semistructured sources we add wrappers to them, while 
the problem of integrating logically semictructured sources is tackled through 
flexible data models and query languages. We discuss these important issues in 
the following two subsections. 



2.3.1 Wrapper Technology 

Sources often do not provide their data in a format, eg. relational tuples, that 
can be directed manipulated by an integration system. For example, if the source 
is an HTML page, it can be the case that structured data (eg. names, phones 
and emails of the faculty of a university department) is embedded in natural 
language text and graphical presentation objects. In such cases the integration 
system communicates with a wrapper, i.e. a program whose task is to translate 
the data in a form that can be further processed by the integration system, eg. by 
removing all HTML markup information. Since sources usually store their data in 
different formats, each source often needs a different wrapper. However, writing 
wrappers can be a tedious task, therefore wrapper generation is an important 
issue. Wrapper generation can be semi-automatic, when support tools are used 
in their generation, or automatic, using machine learning techniques. The work 



of |68], 153] surveys the field and discuss the relation of wrapper generation to the 
more general problem of information extractioitf\. We should note however that the 
emergence of XML as a new standard for representing data in a machine readable 
way (in contrast to HTML which focuses on human readable representations) is 
expected to significantly mitigate the problem of physical data heterogeneity. 



2.3.2 Semistructured Data 

There are many interesting cases where the data can not be constrained by a rigid 
schema. Moreover, in an information integration application where data needs 



*An information repository on information extraction and wrapper generation is maintained 
at http: / /www.isi.edu^muslca/RISE/ . 



9 



to be exchanged and transformed among sources with different data models (eg. 
relational and object oriented) we require very flexible data formats. These needs 
have led to the new research problem of modeling and querying semistructured 
data |l], [184j1 . While there is no strict definition of the term, semistructured 
data is often described as schemaless or self-describing' data, "terms that indi- 
cate that there is no seperate description of the type or structure of data" 0j- 
Semistructured data is captured by flexible data models like the OEM model of 
the TSIMMIS project [^] or, more abstractly, by some form of edge labeled di- 
rected graphs [[H]]. All schema information moves to the data itself, i.e. to the 
graph, in the form of the labels attached to the nodes or edges of the graph (hence 
the term self-describing). 

Flexible query languages like Lorel @, |145|| and UnQL |32|] provide several re- 
quired functionalities such as (i) the ability to query without knowing the exact 
structure of the data, (ii) the provision of navigation queries in the style of Web 
browsing and (iii) the ability to query both the data and the schema at the same 
time. The problem of answering queries using views in the context of semistruc- 
tured data has also attracted attention recently IT], |42|, 169 |. Motivated by the 
importance of schema information for efficient processing, storage and usability of 
data, recent research attempts to bring back some forms of structure or schema to 
semistructured data. Several schema formalisms have been considered in different 
approaches to this problem of extracting schema from semistructured data, eg. in 
|33| , |95| , |158| , |15711 • Logical approaches to schema formulation from semistructured 
data include the use of Datalog rules ||156|| , description logics [0 and mappings 
into the relational database model |Hj . 

The semistructured data model represents a new paradigm and challenge in 
databases and therefore in Information Integration. The database community has 
already made significant progress towards linking semistructured data research to 
emerging standards for information exchange like XML [p00| , 45fl . For instance, 
several of the query languages developed for XML |174| , fl6[ |22fl , bare a strong 
resemblance to query languages for semistructured data. A recent book p| offers 
an extensive coverage of this research theme. 
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2.4 The Semantic Web 



Recently, it has been realized that information integration over the Web can bene- 
fit significantly by adding extra semantic structure over these information sources, 
namely the web documents or web pages. This need for semantic information links 
information integration with the Semantic Web, a term coined by Tim Berners- 



Lee []T6[. The Semantic Web is a vision for the next stage of the Web, described 
as a space of self-describing, machine-pro cessable or "machine-understandable" 
fragments of information in which documents convey meaning. In contrast to the 
existing Web, whose content information is only human-readable, the Semantic 
Web aims to develop the technologies that will enable the machines themselves to 
make more sense out of it. 

The Semantic WebQ is envisaged to have a multi-layer architecture which is 
based on the XML standard for semi-structured Web-Data and related technolo- 
gies (eg. Namespaces, XML-Schema) that support inter-operability mainly on a 
syntactic level. On top of XML lies the Resource Description Framework (RDF) 
|1 20|| , a key concept of the Semantic Web. Essentially, RDF is a simple data 



model for expressing metadata in the form of object-property-value triples, called 
statements. In RDF statements can be the object or value of a triple, allowing in 
this way the nesting of statements. 

While RDF provides the model for describing resources and relationships be- 
tween them, it does not support any mechanism for declaring properties or rela- 
tionships between properties and other resources. For this RDF Schema (RDFS) 
f28j , a schema specification language, is used. RDFS provides a basic type system, 
or vocabulary, for use in RDF models. Important modeling primitives it provides 
include class/subclass, instance-of, and range/domain restrictions. However, we 
should note that W3C recommendations follow a minimalistic approach, so neither 
RDF nor RDFS provide any formal semantics for the constructs they define, nor 
do they provide an inference system or a query language. This lack has motivated 
a number of works that attempt to relate RDF and RDFS to formalisms with clear 
semantics like conceptual graphs |52| or logical languages |57|, |20|, |29|, |51], |140|| . 

Hence although RDF/RDFS provide a powerful infrastructure for semantic 
inter-operability, they do not suffice to capture the advanced knowledge represen- 
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See www.semanticweb.org for a variety of information about the Semantic Web. 
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tation and reasoning needs of the Semantic Web. The Semantic Web architecture 
requires two additional layers, an ontology and a logic layer, that aim at ac- 
commodating richer forms of knowledge. One of the early attempts to employ 
ontologies on the Web were Ontobroker ]74| (that will be described in some de- 



tail in the following) and the SHOE project ||137|| that allows authors to annotate 
their HTML pages with ontology-based knowledge about the page contents. More 



recent approaches are the Ontology Inference Layer - OIL ||2 1 lfl and the DARPA 



Agent Markup Language - DAML [f212|| . OIL |75|, 101] is a system that combines 
the reasoning capabilities of description logics (more specifically that of FaCT 
103| , |104| ), with the rich modeling of frame-based systems and the new Web stan- 



dards of RDF and XML. The DAML Program [|102|, which officially began in 
August 2000, is an "... effort to develop a language and tools to facilitate the con- 
cept of the semantic web.". The DAML language also builts on top of XML and 



RDF, and its latest release called DAML+OIL is available from |213|. Two recent 



papers [[F2|, [76[ provide an up-to-date overview of languages, tools and services of 
the Semantic Web, and discuss new developments and current issues. 



2.5 The Current Role of Logic 

One of the basic requirements of Information Integration frameworks is to allow 
the user to specify what information she wants rather than how to get it. Such 
systems must rely, at least at some level of abstraction, on a declarative repre- 
sentation with a rich formalism for describing the available information. Logic 
can provide such a flexible representation and thus play an important role in the 
development of information integration systems. In the emerging future systems 
where an advanced form of sythesis of information would be facilitated through an 
application layer, above the mediator layer, Logic and Artificial Intelligence will 
have an ever increasing role to play. This wider role of AI in current and emerging 
Information integration systems has been recently presented in the survey paper 

EH- 

Currently, the central role of logic rests on the fact that it provides a framework 
for specifying the mediator layer of an Information Integration system and the re- 
lation between the mediators and source information. Logic as a Knowledge Rep- 
resentation formalism facilitates this and provides (in some cases) techniques for 
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the computational process of the Information Integration system. Most systems 
indeed either use logic explicitly (e.g. COIN or Infomaster) as a KR technique 
or can be cast in to this e.g. the MSL language of the TSIMMIS project. The 
work of ||191|| provides a survey of this role of logic showing how both the global as 
view and the local as view approaches for modeling the information sources and 
their relation to the mediator layer can be understood in terms of Logical views 
over Datalog programs. The extraction and integration of information by these 
systems is then seen as answering conjunctive queries over these programs. 

A third approach to modeling the information in an Information Integration 
system is through the use of Description Logicsf], (e.g. in Information Manifold, 
PICSEL, SIMS). This is used as a richer representational framework that can 
capture additional (meta-level) knowledge on the sources available primarily as 
contraints on the information in these sources. For example, the PICSEL system 



[53 1 which is based on the Description Logic language CARIN is able to combine 
the global as view and local as view approaches and to use the generic inference 
mechanisms of this logic. In particular, it avoids the problem of reformulation of 
the query required in the local as view approach. 

At the level of computational methods used in information integration systems 
the problem of query answering can be directly related to logical inference for the 
reduction of the query to the mediator layer (often this is trivial) and then the 
further reduction of this to the information sources. Indeed, in the global as view 
approach one way to formalize this reduction IS ClS db simple form of abductive 
reasoning ||109| . In the local as view approach this can be seen as a problem of 
reformulating a query in terms of (materialized) views and in turn this is closely 
related to the problem of query containment in Datalog. Logic helps to study 
the theoretical issues concerning the question to what extend it is possible to do 
this reformulation. In Infomaster this query reformulation is achieved via first 
inverting the local as view description of sources in terms of mediator relations 
and then reducing, as in the global as view approach, the original query using 
logical inference techniques to do this reduction. Note that this inversion can 
bring us outside datalog with disjunction and recursion required. 



3 Description logic has also been used ]166| in the related problem of interscheme extraction 
from database schemes. 
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Deductive databases with their strong logical basis can also play an important 
role in information integration, at the analysis and design level and also in the 
actual implementation. Indeed, Datalog-based formalisms are used by most ap- 
proaches that build integrated systems and mediators for information taken from 
multiple heterogeneous data such as legacy databases, external views, and web 
sites [|123| , |125|| . Furthermore, the query optimization problem for these systems 
is frequently solved by means of annotated plans where rules and goals are as- 
signed bound/free annotations. This is the very approach developed and used by 
deductive database compilers and optimizers |173| , |126fl . The Deductive Database 
system of LDL++ |203| has been used as a component in the (early) informa- 
tion integration systems of Carnot ||155 , |162|| and Infosleuth [|1^, |160|| . LDL++ 
also proved very useful in an assortment of other tasks needed for information 
integration, such as the task of cleaning and converting legacy data [|1 78|] . 

In approaches that use description logics query processing is based in a general 
way on the underlying inference algorithm of the logic. This has the advantage 
that the development of a new application domain amounts to the design of a 
suitable knowledge base without the need to develop a specialized query engine. 
The description logic approaches also have a high degree of modularity. This 
stems from the fact that the representation of the sources is separated into two 
parts: (i) a theory that describes the relation between mediator and sources and 
(ii) declaratively stated integrity constraints that capture additional knowledge 
about the information in the sources. 

The COIN framework |25[ adopts a similar approach for its representation of 
the domain using integrity constraints to help resolve semantic conflicts between 
information that can be acquired from different sources, thus giving a form of 
semantic query optimization. This system is implemented on the ECLiPSe parallel 
constraint logic programming environment and relies on techniques of abduction 
and constraint propagation. 

It is important though to note that despite the strong logical flavour that some 
of these information integration frameworks have many of the techniques that are 
used in these systems are specific to the particular needs of the systems. In partic- 
ular, the problems of query optimization, source capabilities, local completeness 
of sources and use of other meta-level information about the source information 
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are typically handled with specific techniques that vary form system to system. 

Some of these techniques e.g. for query planning, take input from AI and 
Logic (and Deductive databases) but there are no general methods of how logical 
techniques can be exploited in these important aspects of the general problem of 
information integration. Computational Logic needs to develop its own specific 
techniques suited for the particular problems of information integration. These 
techniques need not be (and in many cases should not be) entirely logical but they 
should have a hybrid nature of co-operation between logical reasoning and other 
computational methods. 

The working assumption of much of the early research is that the information 
stored in the sources can be captured by a set of relations, i.e. it is assumed that 
sources are relational and can be described by some rigid schema . This assump- 
tion is not valid when working with semistructured data || f|, [7£| and there is 
a current strong shift away from this assumption. It is therefore important for 
logical approaches to be able to adequately represent semistructured information 
sources. There are some studies towards this direction, where a logical theory 
is used to represent the labeled graph data model for semistructured data. The 



FLORID approach [|136|| , described in more detail below, uses F- logic, a logi- 
cal framework that combines techniques from object-oriented databases with the 



power of deductive rules for expressing complex queries. In |35]. |36| the authors 
propose description logics as a logical formalism capable of capturing the labeled 
directed graphs model of and extending it with various forms of constraints 



and the capability to deal with incomplete information. Finally, the MOMIS [15 



approach, that also builds on work in description logics, also provides features 
that allow the representation of some forms of semistructured data. 



3 Information Integration in Practice 

Information integration has important practical applications particularly in view 
of the emerging need to integrate information distributed over the WWW. Sev- 
eral practical systems of Intelligent Information Integration have been devel- 
oped. General information on these can be found at the following URL addresses 
|214| , p!5| , |216[ . We present here some of these systems whose development has to 
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a certain extend been influenced by Computational Logic either at the conceptual 
and/or the implementation level. Other systems that make use of methods from 
Computational Logic but to a lesser degree include the early systems of Infomation 
Manifold gH, TSIMMIS ||, SHOE Q, SIMS @ and WHIRL g9 . 



COIN (COntext INterchange) |26], |27[] is an Information Integration framework 
with emphasis on resolving problems arising from semantic heterogeneity, i.e. 
inconsistencies arising from differences in the representation and interpreta- 
tion of data. This is accomplished using three elements: a shared vocabulary 
for the underlying application domain (in the form of a domain model), a 
data model (COIN), and an application/query language (COINL). Seman- 
tic inter-operability is accomplished by making use of declarative definitions 
corresponding to source and receiver contexts i.e. constraints, choices, and 
preferences for representing and interpreting data. The identification and 
resolution of potential semantic conflicts involving multiple sources is per- 
formed automatically by the context mediator. Users and application de- 
velopers can express queries in their own terms and rely on the context 
mediator to rewrite the query in a disambiguited form. 

The COIN data model is a deductive and object oriented model of the fam- 



ily of F-Logic ||1 1 1|] . The mediation engine's main inference is implemented 
by means of a resolution based abductive procedure in the Constraint Logic 
Programming language of ECLiPSe ||67|| . Queries and subqueries are repre- 
sented by the successive states of a constraint store along one branch of the 
resolution tree. Integrity constraints are translated into Constraint Han- 
dling Rules of ECLiPSe. Their propagation achieves a form of Semantic 
Query Optimization by rewriting the queries and subqueries in the store in 
between the resolution steps and by pruning the rewriting process. COIN 
has been used in various applications of e-commerce. 



FLORID (F-LOgic Reasoning In Databases) |3§ is a logic-based system for 



information extraction and integration from the Web. Web exploration, 
wrapping, mediation, and querying is done in a monolithic system where 
F-Logic serves as a data model and a uniform declarative programming 
language for Web access, data extraction, integration, and querying. The 
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Web and its contents are regarded as a unit, represented in an F-Logic data 
model. Based on the Web structure, given by its hyperlinks, and the parse- 
trees of Web pages, an application-level model is generated by F-Logic rules. 
For this, the F-Logic language is extended with Web access capabilities and 
structured document analysis. By retaining the declarative semantics of 
F-Logic also for the Web interface methods, Web data extraction can be 
programmed in a clear and natural way. In particular, generic rule patterns 
are presented for typical extraction, integration, and restructuring tasks 
[[L42II ) such as HTML lists and tables and syntactical markup. 

Web access, wrapper and mediator functionality, restructuring, and query- 
ing can be arbitrarily combined, and thus FLORID can be used both for Web 
querying and for information extraction. The combination of Web access, 
Web data extraction, and interpretation rules allows for data-driven Web 
exploration: a priori unknown Web pages can be accessed and evaluated 
dependent on previously extracted information. Equipped with suitably in- 
telligent evaluation rules, the system can explore hitherto unknown parts 
of the Web, coping with the steady growth of the Web. The practicabil- 
ity of FLORID has been shown in the case study [ |141|| where geographic 
information from several sources on the Web has been integrated. 

HERMES operates on the philosophy that the role of logic in applications is 
more that of providing a facilitator for cooperation between computational 
models rather than performing the actual computations itself. Logic is used 
to integrate computations in classical datastructures with specialized data 
structures for computations been key to scalability. 

In HERMES, external databases and software packages are assumed to ei- 
ther have a legacy Application Program Interface (API) (or have one built 
for them), and these sources are accessible via their API functions. HER- 
MES mediators contain syntax to execute operations in these packages, and 
to return the answer in a set. HERMES rules are expressed in logic, and 
they allow us to execute boolean combinations of these API function calls. 
HERMES' "base predicates" are just such boolean combinations of API calls 
- "derived predicates" may be defined (recursively or otherwise) in terms 
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of these "base" predicates. An HERMES query may involve accessing data 
in a distributed environment from a RDBS, a logistics data base, a GIS, a 
route planning system and a linear optimizer. Over the years, HERMES has 
been used to integrate Oracle, Ingres, and Quad-tree databases, route plan- 
ners, logistics databases, Dow Jones stock mediators, and intelligent travel 
agents. 



Infomaster An essential feature of Infomaster [91] as an Information Integration 
framework is its emphasis on semantic information processing. Infomaster 
integrates only structured information sources, sources in which the syn- 
tactic form reflects its semantic structure (in other words, databases and 
knowledge bases). This restriction enables Infomaster to process the in- 
formation in these sources in a semantic fashion; information retrieval and 
distribution can be conducted on the basis of content as well as form. 

The core of Infomaster is a facilitator that dynamically determines an effi- 
cient way to answer the user's query using as few sources as necessary and 
harmonizes the heterogeneities among these sources. Infomaster connects, 
using wrappers, to a variety of databases such as for example any SQL 
database through ODBC and some World Wide Web sources. Infomaster 
contains a model-elimination resolution theorem for abduction which acts 
as a workhorse for the query planning process. The information sources are 
described in terms of rules and constraints which are stored in a knowledge 
base using Epilog, a main memory database system. Informaster adopts 
mainly the local as view approach where sources are described as views of 
the mediator relations. It then employs an inversion mechanism to turn 
these into rules for the mediator (or query) predicates and uses its abdcu- 
tive theorem prover, together with constraint solving, to extract a query plan 
from this. Infomaster has been in production use on the Stanford campus 
since fall 1995 and is now commercially available. 



InfoSleuth The InfoSleuth system architecture [T3| , |160|| consists of a set of col- 
laborating agents that work together at the request of the user. Users make 
requests to InfoSleuth from a domain-independent or domain-specific ap- 
plet. Requests are made against an ontology specifying the users domain of 
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interest. An ontology agent together with a broker agent provide the ba- 
sic support for enabling the agents to interconnect and intercommunicate. 
Ontology agents maintain a knowledge base of the different ontologies used 
for specifying requests, and returns ontology information as requested. The 
Broker agent maintains a knowledge base of information that all the other 
agents advertise about themselves, and uses this knowledge to match agents 
with requested services. In this way the broker performs a form of semantic 
matchmaking. 

Within the InfoSleuth system, the agents themselves are organized into lay- 
ers, with the broker and ontology agents serving all of the other agents. At 
the lower layers several other different types of agents for processing informa- 
tion within InfoSleuth exist: User agents, Resource agents, Task Execution 
agents and Multiresource Query agents. The latter process complex queries 
that span multiple resources. They may or may not allow the query to 
include logically derived concepts as well as slots in the ontology. 

The Deductive Database system of LDL++ |203|1 forms a component of the 



Infosleuth system. The main function of LDL++ in this system is imple- 
menting the articulation axioms that support the mapping between hetero- 
geneous schemas. The advantages of LDL++ in this context are enhanced 
by the LDL++ system's ability of translating rules (including those with 
aggregates) into equivalent SQL queries that are then offloaded for more 



efficient execution to remote database servers [203 . 

Infosleuth is used in applications for environmental data exchange, analysis 
of genetic information and business intelligence. 

MedLan The underlying basis of MedLan || is the framework of logic program- 
ming extended with: (i) the possibility of partitioning the code into separate 
programs, (ii) the ability of separate programs to interact, and (iii) the abil- 
ity to answer queries with respect to a combination of programs denoted by 
so called program expressions. 

The possibility of structuring logic programs into modules, that may inter- 
act, allows one to implement classical mediator based architectures, where 
the low-level modules can act as wrappers for different data sources, while 
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the intermediate level modules act as mediators. The possibility of a dy- 
namic combination of programs by applying logic-based composition op- 
erators allows one to construct mediators as semantic views on the data 
provided by the wrappers. Among the operators for combining logic pro- 
grams, the constrain operator is of special importance for the construction 
of mediators. It allows the use of a collection of rules as a set of constraints 
over a logic program. 

The capabilities of MedLan for constructing mediators have been experi- 
mented in two application fields: semantic integration of deductive databases 
and the construction of a declarative analysis level on top of a traditional 
geographic information system. Currently MedLan is also studied as a can- 
didate for expressing security levels for information systems. 



Ontobroker The Ontobroker-system |74[ is a WEB-based application for ontol- 
ogy based search aimed at small communities that are present in the internet 
or internet-like networks. Ontobroker consists of a number of languages that 
allow us to represent ontologies and to annotate web documents with on- 
tological information. It also contains a set of tools that enhance query 
access and inference service in the WWW. This tool set allows us to access 
information and knowledge from the web and to infer new knowledge with 
an inference engine based on techniques from logic programming. It aims 
to use semantic information to guide the query answering process and to 
provide information that is not directly represented as facts in the WWW 
but which can be derived from other facts and some background knowledge. 

The Ontobroker architecture consists of three core elements: a query inter- 
face for formulating queries, an inference engine used to derive answers, and 
a webcrawler used to collect the required knowledge from the web. Each 
of these elements is accompanied by a formalization language: the query 
language for formulating queries, the representation language for specifying 
ontologies, and the annotation language for annotating web documents with 
ontological information. 

The inference engine of Ontobroker is given a formal semantics from Logic 
Programming. It uses generalized logic programs that are translated further 
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into normal logic programs via a Lloyd- Topor transformation. Negation in 
the clause body that can be non-stratified is interpreted via the well-founded 
model semantics fl90| . Standard techniques from deductive databases, such 
as the bottom-up fixpoint evaluation procedure, are also used in the imple- 
mentation of Ontobroker. 

Ontobroker has recently developed applications in the spirit of the Semantic 
Web such as semantic community web portals and tools for human resource 
knowledge management. 



PICSEL |94]] is an information integration system where the mediator is based 
on the logical formalism of CARIN. This formalism combines the expressive 
powers of Horn rules and description logics. CARIN is used as the core 
logical formalism to represent both the domain of application and the con- 
tents of information sources relevant to that domain. CARIN is a logical 
formalism combining the expressive powers of function-free Horn rules and 
description logics. 

The strong use of a logical formalism allows PICSEL to have a declarative 
definition of the relevant concepts for describing the application domain and 
the information sources. This makes it easy to take into account changes 
that can occur frequently e.g. when new sources are added, old sources are 
removed, or when the capabilities of existing sources are modified. Also the 
formal semantics of PICSEL helps the designers to express their knowledge 
in an unambiguous and rigorous way and to define in a precise way the 
problem of answering queries w.r.t to it. 

In PICSEL the problem of answering queries can be identified as a general 
problem of inference in a logical framework. Existing well established tech- 
niques of this framework can help to determine decidability and complexity 
results for the problem of information integration, and to design correct and 
complete algorithms. These algorithms have the advantage to be generic, 
i.e. not specific to the application domain and to the sources and therefore 
they can be reused in another application setting. 

PICSEL has been tested with applications from the travel and tourism do- 
main. It is also used in electronic commerce applications where PICSEL is 
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employed as a tool for integrating different services. 



3.1 Applications of Information Integration 

Although the area of Information Integration is relatively young research in this 
field has had a strong emphasis on applications from the very beginning of its 
existence. This emphasis on practical aspects is growing with time. Much of the 
existing work is grounded by the development of prototype systems and their use 
for some specific application domain. Thus in many cases the research is that of 
the development of methodologies for information integration through principled 
engineering of applications. 

Applications of Information Integration can be divided into two large classes: 
(a) integration of existing legacy heterogenenous databases and (b) integration 
of information available over the World Wide Web. At one end of the spectrum 
information integration systems are developed for a particular domain of interest. 
Examples include the Tambis [i L] project concerned with the problem of integrat- 
ing Bioinformatics Information Sources and the INEEL Data Integration System 



| 167|| that has been applied to problems of environmental restoration. The main 
purpose of these applications is to offer to the user an effective decision support 
tool through the provision of extensive but relevant (to the users needs) informa- 
tion. In such cases there is usually a rich amount of domain specific knowledge 
that can be exploited in various ways by the integration system e.g. in query 
optimization, providing easy query formation and mediator specification, identi- 
fication and resolution of equivalences, etc. 

At the other end of the application spectrum there is a fast growing class of 
applications focusing on integration of Web data resources. While Web-based in- 
tegration systems usually provide generic tools, particular applications focus on 
specific domains of interest like entertainment fll2| , flight delay prediction |p06|l , 
housing rentals ( ||204j| and Federal Goverment Information System ||205|| . More- 
over, information integration is an important enabling technology for a wide class 
of electronic commerce applications (see below for more details on this applica- 
tion domain). Issues like rapid development of wrappers, flexible data models and 
query languages that can easily accommodate semistructured data, and efficient 
query execution, are crucial for such applications. 
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Another emerging new application domain is that of integration of simulation 
results, so that we can also project information into the future ||197|| . These type of 
applications will complement existing applications, which only give a view into the 
past and present, and hence address only one part of the needs of decision-makers. 

The web page of [http: / /www-db.stanford.edu/LlC/companies.html| maintained 
by Gio Wiederhold lists 41 commercial suppliers of Mediation Technology in the 



United States. Some of the more commercial products include OmniConnect [|20§ 



Data Joiner [207], Cross Access [209], and Enterworks [210 



3.2 Information Integration and E-commerce 

One important application where the need for information integration would be 
ever growing is that of E-commerce over the web. The number of people that 
that buy, sell of perform transaction on the Web is increasing at a phenomenal 
rate. Electronic commerce encompasses many issues, such as finding and filtering 
information, securing information, generating dynamic supply-chain links, online 
configuration of products and many others. 

Many systems based on agent technology [^, |139| , |153|| are already present 
on the Web, BargainFinder [ [114jl being the first agent for price comparison. Sys- 
tems of this category, often called Shopbots, can be considered as a first attempt to 
link e-commerce with information integration and agent technology. Clearly these 
systems perform some form of integration of information at the different vendor 
sources, but as the information they return is rather limited (mainly price, avail- 
ability, shipping time etc.) there is no need for a sophisticated query translation 
algorithm as we have described earlier. Other web-commerce agent systems like 
Kasbah Q| pro-actively search for products that may be of interest to the user, 
while Firefly ||1 76|| is one of the early systems that uses collaborative filtering to 
recommend musical albums to its user. Advanced agent systems like AuctionBot 
or Kashab, are capable of negotiating on behalf of the user [P^j . We therefore 
seem to be moving towards a form of information integration that emphasizes the 
aspect of personalization where the integration of information is performed to suit 
each time the needs of the particular user/buyer. 

The functionality of current e-commerce agents is hindered by the fact that 
information on the Web is currently in HTML format. Agents use wrappers to 
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extract, in a heuristic way, product or other "content". Although there exist a 
number of approaches to semi-automate this process, such ad-hoc solutions do not 
seem to scale. Moreover, product information heterogeneity on the semantic level, 
seems to be a more serious obstacle to efficient business information exchange, 
than information representation heterogeneity. In general, there is no agreement 
on fundamental issues, such as what is included in a product domain, how to 
describe products or how to structure product catalogs. 

Product information heterogeneity can be tackled either by standardization or 
integration ||159|| . Industry realizing the importance of resolving information het- 
erogeneity has launched several standardization initiatives. Some of these develop 
horizontal standards (i.e. they cover all possible product areas) such as UN/SPSC 
(Universal Standard Products and Services Classification code, www.unspsc.org), 
while others develop vertical standards (focusing on products of certain type) such 
as RosettaNet for the area of hardware and software products(www. rosettanet.org). 

Two modern technologies, XML and Ontologies, are playing an important role 
in these standardization efforts. XML has been put forward as a important tool 
for tackling inter-operability problems in e-commerce P2"| . Indeed, today there is 
a growing number of XML standards for e-commerce capturing different aspects 
of business activities ||130|| . Although XML is a major step forward, it should not 
be regarded as a solution to all inter-operability problems, but more like a widely 
accepted layer on which to build appropriate semantic information. Although 
there is no single view on how to extend XML to support greater inter-operability 
in e-commerce, it seems that ontologies are becoming increasingly important as 
a component in semantically rich e-commerce services, as advocated in [71], |179| , 



1441 . For instance, buidling consensual and reusable product catalogs is nothing 



more than building an ontology for a certain domain. Indeed, there is a strong 
interaction between the ontologies and online commerce communities. Examples 
include Interprice [|217fl and Content Europe [|218| both providing ontology-based 



support for e-commerce [73 



Although standardization efforts are important, integration still remains an is- 
sue, as "most industrial standards are not quite mature at the current stage, and 



there are no apparent leaders" ||130f |. The proliferation of standards threatens to 
create an electronic marketplace dominated by " commerce islands" , markets that 
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have become isolated by differing proprietary protocols and domain standards. 
Therefore, it is inevitable that some of form of integration will be required by the 
future e-commerce systems ||159|| . Logic could play an important role here in pro- 



viding higher levels of inter-operability needed for the integration of information 
over different e-commerce markets. 

Moreover, logics could contribute not only in solving product information het- 
erogeneity problems, but also to other aspects of e-commerce. One such aspect 
is described in the recent work of 157] that uses Logic Programming to develop 



a framework for integrating business rules for electronic commerce. These rules 
are expressed in a generalized form of logic programs, called courteous logic pro- 
grams, a framework that incorporates a form of conflict handling. The declarative 
semantics of this framework facilitates shared understanding and inter-operability 
between different rules. Pilot applications in e-commerce areas such as negotia- 
tions, catalogs and storefronts have been considered. The framework also supports 
an XML encoding of courteous logic programs. A prototype system called Com- 



monRules is available at 219 



In addition, as the sophistication of e-commerce applications increases, we also 
expect to see a stronger interaction between the Semantic Web and E-commerce 
communities in the future. This interaction could reveal new roles for logic in 
e-commerce applications including the modeling of business rules p7 |, mentioned 



above, or the specification of workflow | 23fl 



3.3 Logic Programming and the Web 

In the last five years there has been a new interest in the field of Logic Program- 
ming aiming at linking Prolog languages to the Web e.g. the PiLLow library for 



Internet /WWW Programming |34j . The main idea is to provide a facility so that 
pages can be downloaded from the Web and turned into a corresponding Prolog 
program containing information extracted from these web pages. In general, this 
is an attempt to encode information pertaining to a web page, which can be either 
information within the page or meta information about the web page, into the 
declarative form of a Prolog program. 

One can then view this as a mechanism for information integration under 
the paradigm of mediation as these logic programs define a common mediator 
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layer for the various web pages retrieved from the web in this way. Analyzing 
and synthesizing the information in these logic programs constitutes a primitive 
form of information integration over the web page sources. PrologCrawler pi] 
and ExpertFinder are two such systems that use this "web page to prolog" 
approach in order to integrate information from various web pages. Similarly, 
WebLog ||118| and D 3 Web [|152|| are Datalog based query languages for information 
held over several web pages. 

The Logic Web system [133, 55| converts web pages and their links into logic 
programs with the additional possibility for a web page itself to contain LogicWeb 
code. These logic programs can then be semantically composed together in several 
ways thus achieving the integration of information from the original web page 
sources. Applications of LogicWeb include a citation search tool ||134|| and a 
system for web-based guided tours ||135 . 

These ideas to convert information from an HTML document of a web page 
to a logic program and utilize these programs to declaratively synthesize the in- 
formation, extend from HTML web pages to XML documents [175] and RDF 
descriptions [57] thus enabling a higher- level of semantic form of information in- 
tegration. 



4 Challenges of CL in Information Integration 

The problem of information integration has been the subject of intensive research 
activity over the recent years, with most of the work concentrating on modeling 
data sources in a single unified view. While significant progress has been made, 
many important problems remain unsolved. Future developments in informa- 
tion integration are expected to center around the following inter-related research 
themes as presented below. In the discussion of these problems we will focus par- 
ticularly on the potential role that logic could play in addressing them. Some of 
these links to logic reflect the authors' personal opinion. 

Representation/Optimization: As noted earlier most of the early work on in- 
formation integration has been carried out under the assumption that information 
sources can be modeled as relational databases. Logic based approaches have fol- 
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lowed to a large extend this line of research. However, nowdays interest is shifting 
fast towards semistructured data. Modeling, storing, managing and querying such 
data are emerging as important problems and will receive much attention in the 
years to come. Moreover, the development of standards for data representation 
and exchange on the Web, like XML, have already a strong impact on data mod- 
eling for information integration. The similarity between XML and data models 
like OEM developed independently by the DB community, will further facilitate 



the study of problems related to XML data management [198]. The role of CL 
in modeling semi-structured data is emerging as an important question that has 
received relatively little attention so far. 

On the other hand, it is now clear that information integration in a dynamic 
environment like the internet, will need to employ effective query optimization 
and execution techniques. Richer forms of knowledge about the content and the 
structure of the sources as well as their inter-relations, are likely to be impor- 
tant elements in the design of future information integration systems facilitating 
new forms of semantic query optimization. Inductive learning and data mining 
techniques can further assist in the extraction of semantic knowledge for query op- 
timization. Query optimization can also be enhanced within a logical framework 
with the further use of integrity constraints by interleaving their satisfaction with 
the query reformulation process. Integrity constraints can express information 
about completeness, preferences, inclusion and other properties of data sources. 

The main focus of the modeling methods for information integration is on pro- 
viding a mediated schema that defines the "semantics" of the underlying sources 
in the strict context of the mediated service that the global schema supports. 
Therefore, there is no need to define the intended meaning of the objects that 
make up the mediated schema. However, large scale information integration, as 
conceived for instance by the Semantic Web initiative, requires tackling data rep- 
resentation problems in a more global context, because "...we expect this data, 
while limited within an application, to be combined later with data from other 
applications" |T7| . 

Integration in the large calls for a semantically rich representation of the data 
that supports sharing, re-usability and extensibility. These requirements render 
metadata and ontologies, as key components of information integration systems 
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105| , |163| , |110| , |195|| . Metadata provides information about data, eg. information 



about database schemas and their intended meaning, which can be captured in 
an ontology that defines a common "vocabulary" for describing different informa- 
tion sources and services. The intended meaning of a database schema will be 
specified in terms of an ontology, and from this ontology (possibly together with 
some inter-ontology mappings) the mediator will be able to specify, automatically 
and through reasoning, the relationship between this schema and other schemata 
defined with respect to the same or related ontologies. Similarly, the user can use 
the same or related ontology to pose queries to the sources. 



Several information integration systems including Information Manifold [|121 
OBSERVER [g, Ontobroker 0, InfoSleuth |]§ [Tjg] and SHOE J[37J use on- 
tologies strongly in their representation language. As noted earlier, ontologies 
also play a crucial role in the Semantic Web architecture. Logic, and in particular 
description logics, form an integral part of ontologies as most of these systems use 
them as their basic formalism for implementing ontologies. Formulating ontologi- 
cal context in logic (as in |7|) and reasoning about properties of ontologies using 
non-monotonic reasoning methods (e.g. default or hierarchical reasoning) seems 
a useful direction of research. 

It is also expected that there will be a growing need for more advanced forms 
of application layers that would enable specialized and advanced forms of inte- 
gration as compared with the relatively shallow integration that is carried out by 
todays systems. This will require declarative and more expressive mediator and 
application layers providing more flexible data models. Again the use of logic 
is one alternative for this purpose. For example, explicit negation in Logic Pro- 
gramming and constraints as in Constraint Logic Programming can be used to 
enrich the representation language. Current experience suggests that this use of 
logic will also need the use of other data models and computational methods in 
order to enhance the computational effectiveness of the overall framework. In 
such a hybrid model, logical reasoning can provide one of the main channels of 
cooperation between the different computational processes involved. 

The work on Semantic Web services | ]146| | is characteristic of the trends in the 
design of the next generation of advanced Web applications. Its main purpose is 
to provide declarative Application Program Interfaces that will enable apart from 
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service discovery, additional features such as automated service execution and 
more automated service composition and inter-operation. The solution proposed 
in [|146|| involves a combination of semantic markup of Web pages using DAML 



languages together with an agent infrastructure that uses situation calculus and 
the ConGolog agent programming language Q . 

Two important technical problems that need to be addressed further irrespec- 
tive of the particular form of more expressive language representation that we 
use are the problems of incomplete information and semantic conflicts or contra- 
dictory data. The problem of incompleteness can appear either at the level of 
the data sources themselves where some information is missing or at the level of 
the description of the mediator architecture as for example with semi-structured 
data. In the later case one issue to address is how we can use logic to describe 
the partial or uncertain information and then use mechanisms that are capable of 
reasoning under such incomplete information. The problem of incompleteness of 
data sources in information integration was studied in |70| and shown to be related 
to non-monotonic reasoning. As such the logical techniques of negation by failure, 
default reasoning and abduction could be useful in addressing this problem. In 
particular, constructive abduction that allows non-ground hypotheses seems to 
be well suited for query reduction under incomplete information. This can give 
existential answers conditional on an set of associated constraints expressing the 
range of values that a missing data entry can take. 

It is inevitable that semantic conflicts or contradictory data will appear in 
information integration particularly when this is done over disparate sources over 
the WWW where there is no central control on the data available in these sources. 
The simplest form of this problem is that of naming mismatch where different 



syntactic names are given to the same semantic entity. WHIRL [49] combines 
techniques from information retrieval and AI to address this problem through an 
appropriate similarity measure. In this way it provides a system that is able to 
reason approximately (but with levels of confidence) with the partial information 
that it extracts from the unstructured textual form of the information sources. 

These two inter-related problems of incomplete information and conflict res- 
olution have been at the heart of many studies in Computational Logic, e.g. in 
default reasoning, extended logic programming, preference reasoning and argu- 
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mentation. We can then examine ways how these largely theoretical methods can 
be adapted and used in the more practical setting for applications of Information 
Integration. 



Automation: For information integration systems to scale up we will need to 
automate at least some part of their development. Currently, the task of the 
description of the data sources that are available to a system is undertaken by the 
creator of the information integration system. But, when the number of sources 
grows, hand-coding the mapping between the mediated schema and the sources, 
is a major bottleneck in deploying large-scale integration systems. Therefore, 
there is a great need for methods and tools that assist or automate the process of 
generating source descriptions. 

There are several places where machine learning can help in the automa- 
tion of information integration. These include (i) the extraction of informa- 
tion from sources as for example in the process of automatic wrapper induction 
flog , |154| , |1 151 , 113 , |180| , (ii) learning mappings between mediator relations and 



source schemas thus automatically constructing part of the central mediator ar- 
chitecture ||171| , |6^ , |148| , |164| , |165|| and (iii) extracting (additional) regularities 



over the data sources and mediator relations as meta-level integrity constraints 
that would be useful in the process of query planning and optimization. In many 
cases, these learning tasks take the usual concept learning paradigm of learning 
a set of concepts from given training examples e.g. learning concepts in the me- 
diator schema from labeled sources of information as instances of these relations. 
This type of learning has been extensively studied in the field of Inductive Logic 
programming (ILP) and its methods have already began to be used in these prob- 
lems [0, |53|, |117]| . The significance and potential of computational logic for these 
problems is high. Indeed, the inherent relational nature of any Information In- 
tegration framework makes the relational learning framework of ILP particularly 
appropriate for the aforementioned machine learning tasks within the general task 
of Information Integration. 

As discussed in the previous section, the use of ontologies is an important ap- 
proach to defining large-scale mediation services. They support sharing, reusabil- 
ity and extensibility, which are all important features for the rapid development of 
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information integration applications. It is therefore natural to expect that the dif- 
ferent aspects of the problem of learning ontologies, as studied in ||138| , |69| , |182| , |183|| 



and ||161|| where a review of existing approaches is given, will become important 
in the future. Relational learning can play a central role in this effort to automate 
the construction of the rich structure of ontology based mediators. 

Personalization: In a mediator-based information integration system two in- 
terfaces have to be implemented, the mediator /source interface and the user- 
application/mediator interface. Indeed, the overall problem of information inte- 
gration can be split into two levels: 

• Interpret what the user (or application program) is asking for in her query 
(or high-level goal). 

• Answer the query (or goal) through a suitable integration of information of 
various sources available. 

While much work has been done at the second level based on a mediator 
architecture that describes sources, i.e. the mediator /source interface, the problem 
of modeling users/application for information integration is less well understood. 
This first level involves understanding the needs of each individual in a given 
context and satisfying these needs in the best possible way. This is a complicated 
problem with many different aspects. Existing user agent systems like Webwatcher 
||107|| and Letizia ||131|| assist the user in locating information on the Internet, by 



tracking the users behaviour. On the information provider side, a current popular 
approach to personalization is the use of data mining techniques on Web-logs 
that track the users browsing behaviour represented as a clickstream |5], |181 



151|| . The discovered patterns can be used in different ways such as changing the 
web structure for easier browsing, predicting future page requests, predicting user 
preferences for active advertising, or making recommendations to the user. 

However, the above approaches either rely heavily on user browsing, or are 
restricted to individual web sites and therefore they do not adequately address all 
aspects and forms of information integration. The problem of personalization in 
a general information integration setting, involves understanding the user (or ap- 
plication) needs in a way that will allow the computation of the most appropriate 
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queries to the sources that would satisfy these needs. In other words, a high-level 
query formation is required that is sensitive to the particular users context and 
needs. In approaches where the user formulates her queries directly in the medi- 
ator language, it is assumed that she is familiar with the vocabulary available for 
posing queries, or with the range of information that is available to the mediator 



1 77|]. In application domains where this is not the case, the user should be assisted 
in formulating her queries [ff?], 96 . 

Moreover, answering user queries in a satisfactory way, involves in many cases 
a level of understanding that can not be reached without some semantic knowledge 
about the data and the context of the query. Although query answering can be 
seen as a process of matching the query description with the source description, in 
many interesting simple syntactic matching is not adequate and semantic 

information is required. Ontology based query formulation is a first step towards 
providing some of the necessary functionality and has been employed in some of 
the existing systems ||110| , |147| , |74|j . On the other hand, we expect future work 
on personalization to be more tightly linked to the emerging metadata languages 
RDF and RDFS Pffi . In addition, current approaches to personalization such as 
collaborative filtering or user clustering, will evolve to accommodate semantically 
richer forms of information that will become available on the internet. 



In general, effective user profiling ||172|| and information brokering based on 
this, is a task that depends heavily on the existence and use of extensive back- 
ground knowledge about the application domain and requires advanced techniques 
of knowledge representation and reasoning. " If we want our computers to under- 



stand us, we will need to equip them with adequate knowledge" ||150|| . Methods 



from ILP for knowledge intensive learning (e.g. [ 170 ]) and the use of CL for ad- 
vanced forms of reasoning, such as default reasoning with (user) preferences and 
constraints, could prove useful in this task, especially in the context of the future 
Semantic Web services. 



Future Forms of Mediation: Future developments will see an ever increasing 
number of applications that use information integration over the Web and in par- 
ticular the emerging Semantic Web. This would range from more advanced search 
engines to applications for specific domains where integration of information rel- 
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ative to a user's general needs is performed pro-actively. The two phases of query 
formation and query planning would then be strongly interleaved for this purpose. 

As we automate more and more the construction process of a mediator this 
evolves into a. facilitator as defined in ||196|| . A facilitator is dynamic and responsive 



to changing situation. It will be able to accept in a dynamic way meta-data 
about information sources and logical statements relating disparate concepts of 
an underlying ontology in order to automatically integrate a new resource into the 
system. There will also be an increasing need for abstraction where the volume 
of synthesized data is reduced while maintaining its essential (for the application) 
information content. For example, instead of responding to a query with a set 
of all answers we can use a rule or intentional answer that characterizes all the 
extensional answers. Clearly, logic can help realize these characteristics features 
of future mediators. 

4.1 Multi Agent Systems for Information Integration 

Information integration on the Web presents us with different challenges that 
require scalable, flexible and extensible solutions. Recent developments in agent 
systems seem to provide a promising supporting technology for realizing large 
scale information integration solutions. The Infosleuth approach fl3| , |160|| , and its 
predecessor Carnot project, pioneered the use of agent technology in information 
integration. In these systems CL had a significant role to play. The broker 
agent, who is responsible for pairing agents seeking for a service with agents 
that can perform that service, is partially implemented in the logical deductive 
Database language LDL++ ||203| . More recently, the potential role of CL-based 



agents in information integration has been demonstrated by a number of works 



e.g. [54. 199 



The CL-agent approach brings together the benefits of declarative specifica- 
tion and rich level of expresiveness offered by computational logic, with a number 
of other benefits derived from the agent architecture. These additional benefits 
include: Reactivity: alertness to changes in the user requirements, and changes to 
the location, availability and content of information sources; Interactivity of the 
systems with the user; and Interleaving of query planning and query plan execu- 
tion: the mediator can liaise with the information sources while it is constructing 
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a query plan, and before a complete plan is constructed. This has the advantage 
of pruning the search for a plan as early as possible. For example, it can alert 
the planning process of the unavailability of an information source, or it can find 
values for partial queries that will substantially reduce further search. 

The approach of ||199|| , uses the agent architecture of Kowalski and Sadri to- 
gether with techniques of abduction. It adopts the global as view approach, where 
global relations are defined in terms of the information stored at the sources and 
incorporates the expression of functional dependencies which can be utilized in 
query planning. Furthermore it allows the expression and use of priorities amongst 
information sources in terms of their reliability and degree of completeness with 
respect to items of data. 

We expect that in the near future, as applications become more complex, a 
closer link between information integration and agent technologies will be estab- 
lished. Future information integration systems will require agents for different 
activities, over and above those for the mediator, such as those for user profiling, 
symbolic learning and provision of user-friendly interfaces. The use of logic-based 
agents for information integration paves the way to the synthesis of information 
integration with other techniques to give such more powerful systems. 

5 Conclusions 

The area of Information Integration is a young but fast growing field that has 
emerged from the need to exploit more fully the available data spread over var- 
ious databases of different type. It has been given a special impetus with the 
appearance of the World Wide Web where the need for new research on the 
problem of integration of information spread over the web is considered to be of 
paramount importance not least because of its enormous potential for commercial 
exploitation. This development has meant that together with new database tech- 
niques there is an ever growing need for the use of AI techniques such as those of 
multi-agent systems, knowledge representation (including in particular ontologies 
and hierarchies), natural language, resolution of conflict and machine learning. In 
fact, turning this around, the problem of information integration over the Web is 
providing an excellent opportunity for a new experimental arena where AI theory 
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and techniques can be applied and tested. 

To the extent that Computational Logic belongs (also) to AI (see the recent 
book on "Logic-Based Artificial Intelligence" published after a meeting held in 
Washington in June 1999 [[L49|l ) we can see that Logic will also have a role to 
play in this new research area. A central role of logic in information integration 
concerns the problem of specifying a mediator. This is analogous to the problem 
of views in databases and, as exposed early in ||191|| , it is possible to use logic 
to formalize most if not all mediator architectures that have been proposed. In 
some of these approaches logic is used explicitly to specify and to a certain extent 
implement the mediator architecture. 

Indeed, the potential usefulness of logic was realized from the very beginning 
of investigations in this problem. As the work developed to address the problem 
in a more complete way the role of logic was exposed more clearly to be that of a 
facilitator for the cooperation between the different other computational processes 
involved in an information integration system. Logical inference can be used as 
the mechanism for communication and intelligent cooperation between these pro- 
cesses. We expect that this central role of logic will be further exposed as the 
ontologies used for mediation get richer and there is more scope for reasoning 
with these ontologies. This would help to develop more advanced forms of inte- 
gration compared with the relatively shallow integration of today's frameworks 
and systems. Also logic can be used to link the advanced and specialized (domain 
specific) needs of the application to the more general (problem independent) lower 
layers of the mediator architecture. An alternative way to formalize this upper 
application layer is using an algebra (see the SKC project [|116|| ) and hence the 
merits of each approach, logical or algebraic, need to be compared. 

A specific problem of immediate importance is that of rationalizing semi- 
structure data in the same way that logic rationalizes databases. This concerns 
the use of logic to formalize directed graphs of semi-structured data in the same 
way that we formalize in logic the (relational) tables of databases and their query 
language. Work in this direction has already started ||136| , |35| , opening a new op- 
portunity for logic-based databases. More generally, logic can help as a unifying 
basis with which we can add structure to the meaningful content of web pages so 
that a higher-level of semantic information integration can be performed over the 
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Web. Linking logical inference with ontological information is an important next 
development for building this vision of the Semantic Web. 

Comparing the potential role of logic in Information Integration with its role 
in other areas in the past e.g. in the development of Constraint Programming, 
we see again that when we consider an application domain in its entirety the 
central role of logic is to provide a modeling environment for the problem (in our 
case the overall mediator architecture) and the link of this description to other 
computational methods needed to solve the problem. Of course, if and when some 
of these other computational processes can themselves be performed in a logical 
setting this central communicator role of logic can be implemented more tightly 
enabling more functionality. In the problem of information integration such cases 
are the use of (i) Inductive Logic Programming for the automatic generation 
of mediators, (ii) logic-based multi-agents as a framework for implementing the 
overall communication layers of a mediator architecture and (iii) Constraint Logic 
Programming to help address the scaling problem through the use of meta-level 
constraint information in query planning. 



We would like to end this survey with a quote from a recent article [18] of 
Tim Bernels-Lee, James Hender and Ora Lassida: "Adding logic to the Web ... 
is the task before the Semantic Web community at the moment" . This statement 
reveals succinctly the potential role of computational logic in the present and 
future development of Information Integration. 
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